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ABSTRACT 

Here, we present LNCipedia (http://www.lncipedia 
.org), a novel database for human long non-coding 
RNA (IncRNA) transcripts and genes. LncRNAs con- 
stitute a large and diverse class of non-coding 
RNA genes. Although several IncRNAs have been 
functionally annotated, the majority remains to be 
characterized. Different high-throughput methods 
to identify new IncRNAs (including RNA sequencing 
and annotation of chromatin-state maps) have been 
applied in various studies resulting in multiple unre- 
lated IncRNA data sets. LNCipedia offers 21488 
annotated human IncRNA transcripts obtained 
from different sources. In addition to basic tran- 
script information and gene structure, several stat- 
istics are determined for each entry in the database, 
such as secondary structure information, protein 
coding potential and microRNA binding sites. Our 
analyses suggest that, much like microRNAs, 
many IncRNAs have a significant secondary struc- 
ture, in-line with their presumed association with 
proteins or protein complexes. Available literature 
on specific IncRNAs is linked, and users or authors 
can submit articles through a web interface. Protein 
coding potential is assessed by two different predic- 
tion algorithms: Coding Potential Calculator and 
HMMER. In addition, a novel strategy has been 
integrated for detecting potentially coding IncRNAs 
by automatically re-analysing the large body of 
publicly available mass spectrometry data in the 
PRIDE database. LNCipedia is publicly available 
and allows users to query and download IncRNA se- 
quences and structures based on different search 
criteria. The database may serve as a resource to 



initiate small- and large-scale IncRNA studies. As 
an example, the LNCipedia content was used to 
develop a custom microarray for expression 
profiling of all available IncRNAs. 

INTRODUCTION 

Long non-coding RNAs (IncRNAs) constitute a recently 
discovered class of non-coding RNAs that grew in size 
drastically during the past few years. LncRNA genes 
give rise to long (>200bp) and often multiexonic tran- 
scripts that are supposed not to get translated to 
protein, as commonly assessed by means of in silico pre- 
diction algorithms (1). In comparison with their 
protein-coding counterparts, IncRNA genes are poorly 
conserved (2) and are more numerous in biologically 
complex species (3). Although only a fraction of the 
IncRNA genes has been characterized experimentally, 
IncRNAs seem to function as transcriptional regulators 
through direct interaction with chromatin-modifying 
proteins and transcription factors (1,4,5). 

LncRNAs with experimentally validated functions or 
expression patterns have been named accordingly. 
Notable examples are XIST (X inactive-specific transcript) 
(6), HOTAIR (HOX transcript antisense RNA) (7) and 
HULC (highly up-regulated in liver cancer) (8). The 
HUGO Gene Nomenclature Committee currently uses 
several schemes to name IncRNAs with an unknown 
function. LncRNAs that reside on the opposite strand to 
(antisense) or in an intron of (intronic) a protein-coding 
gene are named after the protein-coding gene with suffixes 
'-AS' and '-IT', respectively. Intergenic IncRNAs are 
numbered and get the prefix 'LINC (9). 

Recent advances in non-coding RNA research have led 
to the creation of several IncRNA resources. LncRNAdb 
focuses on IncRNA transcripts with well-described func- 
tions in literature (10), whereas the ncRNA database 
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(ncRNAdb) provides RNA sequences and annotation 
from different sources (11). The NONCODE database 
(12) contains a larger collection of human long 
non-coding RNAs (33 829) obtained from different 
sources and by different experimental procedures (13). 
Rfam provides structures and annotation of well-known 
RNA families along with predictions of new members of 
these families (14). However, it does not provide informa- 
tion for an individual IncRNA. Although each of these 
resources provides valuable information, database unifica- 
tion and integration of IncRNA transcript sequence 
details with a broad set of bioinformatics tools and a uni- 
versal IncRNA gene building and naming scheme is cur- 
rently lacking. Here, we present LNCipedia, a catalogue 
of 21 488 IncRNA transcripts that were clustered into 
genes and named accordingly, and they were analysed 
using multiple bioinformatics tools, revealing insights in 
IncRNA structure, experimentally verified (lack of) 
protein coding potential, function and regulation. We 
believe such a database facilitates human IncRNA 
research and communication among scientists. 



DATABASE DEVELOPMENT 

The sources used in the data collection step are listed in 
Table 1. The most recent version of each source at the time 
of development has been included. The sequences and 
annotations are extracted and stored in a mongoDB 
database using custom Perl scripts. To this purpose, 
import scripts for different file formats, such as FASTA, 
BED and GFF, have been developed. Redundant tran- 
scripts are grouped in a single record, while maintaining 
all annotation from the original sources. The web interface 
for LNCipedia is build using the Mojolicious Perl web 
framework and offers different ways of querying the 
data (Figure 1). LNCipedia will be updated when newer 
versions of the IncRNA sources are released or if new 
sources become available. In addition, researchers are 
encouraged to submit new transcript sequences or anno- 
tations trough lncipedia.org. 

Of note, each of the input sources uses a different 
naming scheme. LncRNA researchers have previously 
used the gene symbol of the nearest protein coding 
gene to refer to a given IncRNA (15). Based on this 



strategy, we have implemented a universal IncRNA no- 
menclature to ease communication among researchers. 
Different IncRNA transcripts are considered to belong 
to the same gene if they share at least one (partially) 
overlapping exon and reside on the same DNA strand. 
In this way, transcripts are clustered into genes. These 
IncRNA genes are then named after the HUGO symbol 
of the nearest protein-coding gene on the same strand 
using the following scheme: inc-HUGO-#\ The 
IncRNA genes are numbered, starting with the IncRNA 
gene closest to the protein-coding gene. A second 
number is added to denote the different transcript 
variants starting with the most upstream transcript, for 
example, lnc-MYCN-l:l denotes transcript 1 from gene 
lnc-MYCN-1 (Figure 2). 

INTEGRATED ANALYSIS TOOLS 

LncRNA-protein interactions are, in part, mediated by 
the secondary structure of the IncRNA. The Vienna 
RNA package (16,17) consists of a set of algorithms for 
predicting and analysing RNA secondary structures. We 
applied the RNAfold algorithm to generate a secondary 
structure plot and dot plot with pair probabilities. Both of 
these images are processed with the provided relplot.pl 
script to obtain a structure plot with colour annotated 
base pair probabilities. The output postscript (.ps) 
images are converted to the graphics interchange format 
(.gif) for display in web browsers. 

Structural RNAs, such as miRNAs, have a significantly 
lower minimum free energy of folding compared with 
randomly shuffled sequences (18). The Randfold algo- 
rithm implements the randomization test and returns the 
mean free energy of folding and P-value for every RNA 
sequence. Hence, a significant /"-value denotes a high pro- 
pensity in the sequence towards a stable secondary 
structure. 

Recently, it has been shown that IncRNAs can act as 
a miRNA sponge by binding specific microRNAs and, 
thus, interfering with their role as negative regulators of 
gene expression (5,19,20). We include miRNA seed pre- 
dictions for every IncRNA to allow researchers to evaluate 
possible miRNA-lncRNA interactions. miRNA seed 
predictions were performed using the MirTarget2 
algoritm (21). 



Table 1. The different sources of IncRNA transcripts used for 
LNCipedia at the time of development" 



Source 


Version 


Number of 
transcripts 


Ensembl 


Version 64 


9069 


(biotype = lincRNA) 






Human 




14279 


bodymap lincRNAs (2) 






LncRNAdb (10) 


September 201 1 


134 


Total number 




21488 


of unique transcripts 







"The database will be updated with new transcripts when new versions 
of the sources are released. 



PROTEIN CODING POTENTIAL 

Assessment of protein coding potential is an important 
aspect in the study of non-coding RNAs. LNCipedia 
reports the outcome of two different protein coding 
potential prediction algorithms. The Coding Potential 
Calculator (CPC) applies a support vector machine 
classifier to the output of open reading frame analysis 
and Basic Local Alignment Search Tool search (22). 
CPC returns the predicted status of the transcript 
(coding/non-coding) and a coding potential score. We 
applied version 0.9 of the CPC software and report the 
predicted status and the coding potential score for every 
transcript. Another popular strategy for detection of 
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Figure 1. LNCipedia is generated in a multistep process that comprises importing, naming, analysis and visualization of IncRNA genes. Import 
scripts for the FASTA, BED and GFF file formats process IncRNA transcripts and detect redundancy. LncRNA naming is preceded by the creation 
of IncRNA transcript clusters and requires information on the nearest protein-coding gene on the same DNA strand. Every IncRNA transcript is 
subsequently analysed using multiple algorithms, and the results are appended to the database. A web-interface build using Perl enables IncRNA 
visualization and database querying. 



hc-SOX1-3:2l I ■■ toc-SOX1-1:2 lnc-SOX1-2:3 > 

lnc-S0X1-3:1 ■ ■ lnc-SOX1-1:1 ■ ■ < lnc-SOX1-2:4 

lnc-SOX1-3:3 ^_|«^^ 
lnc-50X1-3:5 1 9 
lnc-SOX1-3:4 ^a_f 

lnc-SOX1-3 SOX1 (protein coding) lnc-SOX1-1 lnc-SOX1-2 

Figure 2. The SOX1 protein-coding gene locus contains three IncRNAs on the same DNA strand, numbered according to their distance in relation 
to SOX1. LncRNA transcripts are numbered according to their order in the gene, starting with the most upstream transcript. 



coding sequences is based on known protein domains. The 
HMMER3 suite provides software based on hidden 
Markov models for sequence based homology searches 
(23). It is often used in combination with the Pfam 
protein families database (24). Using the hmmscan 
algorithm, we searched for Pfam protein domains in the 
RNA sequence. All six reading frames were translated in 
silico, and the number of hits in 5' to 3' and 3' to 5' 
direction are reported. 

A unique feature of LNCipedia is the incorporation of 
an automated reprocessing pipeline that relies on publicly 
available fragmentation spectra from the PRIDE database 
at EMBL-EBI (25) to detect potentially coding IncRNAs. 
The concept behind this feature is that mass spectrometry 
based proteomics data may contain serendipitously 
recorded mass spectra derived from translated IncRNAs. 
As standard identification strategies in proteomics are 
based on searching these spectra against protein 
sequence databases, such as UniProtKB/Swiss-Prot (26), 
they are implicitly unable to detect coding forms of 
IncRNAs, as they are not present in these databases. To 
uncover such potential traces of coding IncRNAs, the 
spectra, thus, need to be re-searched against a purpose- 
built database that comprises a combination of the 



possible translations of known IncRNAs, the known 
proteins for that organism as obtained from a traditional 
sequence database and corresponding decoy sequences for 
both these constituent databases for quality control and 
FDR estimation purposes (27). A spectrum can, thus, be 
matched against a IncRNA, a known protein, or a decoy 
sequence. The known proteins must be included to prevent 
relatively low-scoring matches of spectra against IncRNAs 
to be picked up where a much better match for that 
spectrum can be found for a known protein. 

We have implemented such a pipeline by using the 
SearchGUI tool (28) to run the X!Tandem (29) search 
algorithm. All results are then collated and filtered at 
1% FDR by the PeptideShaker algorithm (http://code 
. google. com/p/peptide-shaker). The pipeline infers the 
original search parameters, such as mass errors and 
post-translational modifications both directly from the 
PRIDE database and by using the PRIDE automatic 
spectrum annotation pipeline (http://code.google.eom/p/ 
pride-asa-pipeline). All the tools and algorithms used are 
freely available as open source. 

The pipeline has so far been ran on 149 PRIDE 
experiments from at least 15 different tissues, yielding 
81 579 pep tide-to-spectrum matches (PSMs) against the 
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Transcript: lnc-SMUG1-3:6 



Basic information 

Incipedia transcript ID: lnc-SMUG1-3:6 
Incipediagene ID: lnc-SMUG1-3 
Location: chrl 2:54356092-54368740 
Strand: 

Transcript size: 2421 bp 
Exons: 7 

Sources: Ensembl release 64 - Sep 201 1 
Alternative transcript names: ENST00000424518 
Alternative gene names: ENSG00000228630; HOTAIR 

RNA sequence: 

CCACTTCTC AC G CGAG AG CCGCG G CTG ACAG C GTCTCC C ACACAAG C AAAG CCCTCCAC CCTCG AGGCGCTCCCTTCTGCCTC C ACATTCTGCCCTC ATTTCCGG AACCTGGAACCC 
TACCCACCCACTCCCCAACTCTCACTCCCCTCTCCTCTCCACCTTCATCCCAAACCTTCCACAGTCACCACTCCTCCCTCCCCCTAAGACACCACCACCCACTCACCCCTCCCAGTT 
CCACACAC C AAC A CCCCTCCTCCTGGCGGCTCCCACCCGG G ACTTAG ACCCTCAGCTCC CTAAT ATC CCGGACCTGCTCTC AATC AG AAAG GTCCTGCTCCCCTTCGCAGTCGAATC 
CAACGGATTTAGAAGCCTGCAGTAGGGGAGTGGGGAGTGGAGAGAGGGAGCCCAGAGTTACAGACCCCGGCGAGAGGAAGCAGGCGCGTCTTTA I I I I I I I AACCCCCCAAAGACT 
CTG ATGTTTACAAG ACCACAAATGCCACGGCCCCGTCCTGGCAC AG AAAAC C CTG AAATCCAGG ACCGGCC CCTTCCTTATAACTATCCAC ATTC CCCAC AC AAGTC CTG C AACCTA 
AACCAGCAATTACACCCAAGCTCGTTGGGGCCTAAGCCAGTACCGACCTGCTAGAAAAAGCAACCACGAAGCTAGACAGAGACCCAGAGGAGGGAAGAGAGCGCCAGACGAAGCTG 
AAAGCGAACCACCCAC AG A AAT GCAGGCAAGGGACCAACCCGGCACTTCCCGGAACAAACCTGGCACAGGGCAAGACGGGCACTCACAGACAGACC TTT ATGT ATTTTTATTTTTTA 
A A ATCTG ATTTG GTGTTCCATGAGGAAAAGGG AA AATCT A GGGAACGGGAGTACAGAGAG A ATA ATCC GGGTCCTAGCTCGCCACATGAACGCCCAGAGAACGCTGGAAAAACCTGA 
CCGGGTGCCGGGGCACCACCCGGCTCGGGTCAGCCACTGCCCCACACCGGGCCCACCAAGCCCCGCCCCTCCCGGCCACCCCCCCTTCCTTCCTCTTCTTATCATCTCCATCTTTAT 
GATGAGGCTTGTTAACAAGACCACAGAGCTGGCCAACCACCTCTATCTCAGCCGCGCCCGCTCAGCCGAGCAGCCGTCGGTCCCGGGACTGGGACCCGCTAATTAATTGATTCCTTT 
CCACTGTAAAATATGGCGGCCTCTACACCCAACCCATCCACTC AT A AAC A ATATATCT GTTGGGCCTGAGTCCACTGTCTCT C A A ATAATTTTTC C AT A C C C A AATC TCAGAGGGTTC 
TG G ATTTTTAGTTG CTAACCAAACATCCAAATGCCACC A ATTTTAC GAGGCCCAAACACAGTCCGTTCACTGTCAGAAAATGCTTCCCCAAACCGGTTGGGACTGTG TTTTCTTC C A A 
AAAAGCTTCCGTTATACGAAACCCTTTCCCTCCTACTTGTGTAGACCCAGCCCAATTTAAGAATTACAACCAAGCCAACGGGTTGTCTAGGCCGGAACCCTCTCTCTCCCCGCTGGAT 
CCACCCCACTTGAGCTGCTCCGC A ATTTG AG A C C A AC ATA GAACCAAACGTCCAGC CTTTG CTTCCTGCTGATTCCTA G ACTTA AC ATTC A A A A AC A A ATTTTT A A A AC TCAAACCAG 
CCCTAGCCTTrGGAACCTCTTGAACCTTCAGCACCCACCCAGGAATCCACCTGCCTGTTACACCCCTCTCCAACACACACTGGCACCCCTTTTCTAACTGGCAGCACACACCAACTCT 




Structure: 



Protein coding potential 

CPC coding potential score: -1.19011 (noncoding) ; ? ; 
HMMER Pfam domains in 3' to 5" reading frames: 0 ? 
HMMER Pfam domains in 5' to 3' reading frames: 0 

PRIDE database search 

Number of hits in the PRIDE database: 0 ? 

Secondary structure information 

RNAfold image: download 

Randfold minimum free energy: -825.83 

Randfold P-value: 0.001 



Targetting miRNAs 

MirTarget2 predictions: 



MicroRNA 


MirTarget2 score ? 


hsa-miR-3688-3p 


93.51 


hsa-miR-1251 


87.25 


hsa-miR-202-5p 


82.56 


hsa-miR-26b-3p 


81.72 


hsa-miR-892a 


80.28 



Available literature 

• Guiletal.,2012 

. Niinumaetal.,2012 

• Kogo etal., 2011 

• Schorderetetal., 2011 

• Geng et al., 2011 

• Kanekoetal., 2010 

• Tsai etal., 2010 

• Gupta etal., 2010 



Figure 3. The transcript page in the web interface provides a clear overview of information available on a specific IncRNA transcript. 
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custom-built protein sequence database that includes 
UniprotKB/Swiss-Prot and LNCipedia translations 
(Supplementary Figure SI). Within these PSMs, there 
were just 14 matches that could provide evidence for 
translation of LNCipedia entries. However, after close 
inspection of the FDR of the PSMs that passed our 
quality criteria, we noticed that although the PSMs from 
UniProtKB/Swiss-Prot have an expected FDR of 0.9%, 
the subset of PSMs from translated LNCipedia entries 
comes with an overwhelming FDR of 166% 
(Supplementary Figure S2). As such, there are only 
vague suggestions so far that any of these entries can 
effectively be translated. 

As the PRIDE database is growing exponentially, and 
additional IncRNA transcript discovery is ongoing, 
searches for potentially coding IncRNAs need to be 
carried out anew at regular intervals to stay up-to-date 
with the growing amount of public data. We, therefore, 
envision running the full pipeline on all applicable PRIDE 
data at a set interval of 3 months; thus, periodically 
updating the knowledge on which IncRNAs might have 
coding potential. The output of each reprocessing effort 
will be used to annotate the LNCipedia, and past results 
will be kept available as well. 

Besides this recurrent re-analysis of the relevant publicly 
available proteomics data, we also plan to extend the 
statistical approach used to evaluate the identification of 
a IncRNA by including information about the consistency 
with which such an identification is found across 
(unrelated) PRIDE experiments. Indeed, a relatively 
poor match in any individual experimental data set that, 
however, keeps returning across many such data sets, may 
well be a real indication that translation is taken place for 
that IncRNA. 



LNCIPEDIA ACCES 

LNCipedia is publicly available through a web interface at 
http://www.lncipedia.org. The interface allows users to 
query IncRNAs by name, chromosomal region or 
(partial) sequence. Several statistics are calculated that 
allow the user to evaluate different parameters regarding 
IncRNA secondary structure and regulation (Figure 3). 
The entire LNCipedia collection is available for 
download in the FASTA, GFF or BED format. 

LncRNA researchers can contribute to LNCipedia by 
contacting the authors. In addition, registered users can 
modify existing records (updating aliases and adding 
PubMed literature records) directly using a web interface. 

LNCRNA EXPRESSION ARRAY 

The LNCipedia content can prove useful when designing 
large-scale screening experiments, such as IncRNA gene 
expression profiling. As a proof of concept, we have 
developed a custom IncRNA gene expression array using 
the Agilent Sureprint 60 k platform. In addition to 
roughly 33 000 probes for protein coding genes, we 
selected 23 042 probes for IncRNA transcripts in 
LNCipedia covering 97% of all LNCipedia transcripts 



with at least one probe (Agilent MicroArray Design ID: 
039714). The performance of the expression array was 
evaluated using RNA sample titrations according to the 
MicroArray Quality Control standards (30). Adequate 
titration response of the IncRNA probes is shown in 
Supplementary Figure S3. 

CONCLUSION AND FUTURE DIRECTION 

Three important features are unique to LNCipedia: gene 
definitions and usage of a universal nomenclature for 
IncRNA transcripts, PRIDE analysis for detection of 
IncRNAs that may code for small peptides and miRNA 
seed predictions for IncRNA transcripts. These, along with 
the other tools available, are expected to make LNCipedia 
a powerful resource for human IncRNA research. 

With the advances in RNA sequencing technology, 
more IncRNA genes are expected to get discovered. The 
authors will update LNCipedia when new sequences are 
reported in the literature or in other sources. In addition, 
new features will be developed to increase the interactive 
capabilities of LNCipedia. In this way, the IncRNA 
community will be able to upload and maintain records 
in the database. LNCipedia has the potential to become a 
community resource for IncRNA transcript information 
and annotation. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Figures 1-3 and Supplementary Methods. 
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